Using Semantics & Statistics to Turn Data into Knowledge
نویسندگان
چکیده
Many information extraction and knowledge base construction systems are addressing the challenge of deriving knowledge from text. A key problem in constructing these knowledge bases from sources like the web is overcoming the erroneous and incomplete information found in millions of candidate extractions. To solve this problem, we turn to semantics – using ontological constraints between candidate facts to eliminate errors. In this article, we represent the desired knowledge base as a knowledge graph and introduce the problem of knowledge graph identification, collectively resolving the entities, labels, and relations present in the knowledge graph. Knowledge graph identification requires reasoning jointly over millions of extractions simultaneously, posing a scalability challenge to many approaches. We use probabilistic soft logic (PSL), a recently-introduced statistical relational learning framework, to implement an efficient solution to knowledge graph identification and present state-of-the-art results for knowledge graph construction while performing an order of magnitude faster than competing methods. A growing body of research focuses on extracting knowledge from text such as news reports, encyclopedic articles and scholarly research in specialized domains. Much of this data is freely available on the World Wide Web and harnessing the knowledge contained in millions of web documents remains a problem of particular interest. The scale and diversity of this content poses a formidable challenge for systems designed to extract this knowledge. Many well-known broad domain and open information extraction systems seek to build knowledge bases from text, including the NeverEnding Language Learning (NELL) project (Carlson et al., 2010), OpenIE (Etzioni et al., 2008), DeepDive (Niu et al., 2012), and efforts at Google (Pasca et al., 2006). Ultimately, these information extraction systems produce a collection of candidate facts, that include a set of entities, attributes of these entities, and the relations between these entities. Information extraction systems use a sophisticated collection of strategies to generate candidate facts from web documents, spanning the syntactic, lexical and structural features of text (Weikum and Theobald, 2010; Wimalasuriya and Dou, 2010). While these systems are capable of extracting many candidate facts from the web, their output is ofCopyright c © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ten hampered by noise. Documents contain inaccurate, outdated, incomplete, or hypothetical information, and informal and creative language used in web documents is often difficult to interpret. As a result, the candidates produced by information extraction systems often miss key facts and include spurious outputs, compromising the usefulness of the extractions. In an effort to combat such noise, information extraction systems capture a vast array of features and statistics, ranging from the characteristics of the webpages used to generate extractions to the reliability of the particular patterns or techniques used to extract information. Using this host of features and a modest amount of training data, many information extraction systems employ heuristics or learned prediction functions to assign a confidence score to each candidate fact. These confidence scores capture the inherent uncertainty in the text from which the facts were extracted, and can ideally be used to improve the quality of the knowledge base. While many information extraction systems use features derived from text to measure the quality of candidate facts, few take advantage of the many semantic dependencies between these facts. For example, many categories, such as “male” and “female” may be mutually exclusive, or restricted to a subset of entities, such as living organisms. Recently, the Semantic Web movement has developed standards and tools to express these dependencies through ontologies designed to capture the diverse information present on the Internet. The problem of building domain-specific ontologies for expert users with Semantic Web tools is challenging and well-researched, with high-quality ontologies for domains including bioinformatics, media such as music and books, and governmental data. More general ontologies have been developed for broad collections such as the online encyclopedia Wikipedia. These semantic constraints are valuable for improving the quality of knowledge bases, but incorporating these dependencies into existing information extraction systems is not straightforward. The constraints imposed by an ontology are generally constraints between facts. For example, candidate facts assigning a particular entity to the categories “male”, “female”, and “living organism” are interrelated. Hence, leveraging the dependencies between facts in a knowledge base requires reasoning jointly about the extracted candidates. Due to the large scale at which information extraction syscountry Kyrgyzstan Kyrgyz Republic
منابع مشابه
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding
Spoken language understanding (SLU) is a core component of a spoken dialogue system. In the traditional architecture of dialogue systems, the SLU component treats each utterance independent of each other, and then the following components aggregate the multi-turn information in the separate phases. However, there are two challenges: 1) errors from previous turns may be propagated and then degra...
متن کاملنظریه پردازی بر فرآیند انتقال دانش نظری به حوزه عمل در پرستاری: رویکرد گراندد تئوری
Introduction & Objective: Knowledge transfer and in fact, the bridging of theory and practice is one of the main concerns of all academic disciplines. Getting prominent professional status is the thing that can be achieved by knowledge-based function, and of which would be called as successful discipline that it be able to transfer its theoretical paradigmatic claims into practice. Accordingly,...
متن کاملUsing Semantics and Statistics to Turn Data into Knowledge
SPRING 2015 65 Agrowing body of research focuses on extracting knowledge from text such as news reports, encyclopedic articles, and scholarly research in specialized domains. Much of this data is freely available on the World Wide Web and harnessing the knowledge contained in millions of web documents remains a problem of particular interest. The scale and diversity of this content pose a formi...
متن کاملTowards Semantics-Enabled Distributed Infrastructure for Knowledge Acquisition
We summarize progress on algorithms and software knowledge acquisition from large, distributed, autonomous, and semantically disparate information sources. Some key results include: scalable algorithms for constructing predictive models from data based on a novel decomposition of learning algorithms that interleaves queries for sufficient statistics from data with computations using the statist...
متن کاملAnalyzing the problem of meaning in Shabastari’s Golshane Raz
Man has always been finding a complete model for semantics since the beginning. A model which can as a paradigm affects all branches of sciences. In the view of author, such a model can be found in Golshane Raz. Introducing the model from the work mentioned, the paper has tried to explain its sub structural foundations in three fields of ontology, epistemology and semantics. Some of the foundat...
متن کاملInterrogation of a University Classrooms in the Court of Semantics: Managerial Implications
The purpose of this article, within the framework of an interpretive study, was to study the semantics of a universitychr('39')s classrooms to create a critical awareness of the meanings of the symptoms and their functions at the context of physical artifacts, besides their managerial implications. To accomplish this goal, after taking pictures of the structural elements of the studied classroo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014